Using two-level stable storage for efficient checkpointing - Software, IEE Proceedings- [see also Software Engineering, IEE Proceedings]

نویسنده

  • L. M. Silva J. G. Silva
چکیده

Checkpointing and rollback recovery is a very effective technique to tolerate the occurrence of failures. Usually, checkpoint data is saved on disk, however, in some situations the time to write the data to disk can represent a considerable performance overhead. Alternative solutions would make use of main memory to maintain the checkpoint data. The paper starts by presenting two main memory checkpointing schemes: neighbour based and parity checkpointing. Both schemes have been implemented and evaluated in a commercial parallel machine. The results show that neighbour based checkpointing presents a very low performance overhead and assures a fast recovery for partial failures. However, it is not able to tolerate multiple and total failures of the system. To solve this shortcoming the authors propose a two-level stable storage integrating the use of neighbour based with disk based checkpointing. This approach combines the advantages of the two schemes: the efficiency of diskless checkpointing with the high reliability of disk based checkpointing.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimal restart times for moments of completion time - Software, IEE Proceedings- [see also Software Engineering, IEE Proceedings]

Restart is an application-level technique that speeds up jobs with highly variable completion times. The authors present an efficient iterative algorithm to determine the restart strategy that minimises higher moments of completion time, when the total number of restarts is finite. They demonstrate its computational efficiency in comparison with alternative algorithms. They also discuss fast ap...

متن کامل

FADI: a fault tolerant environment for open distributed computing

FADI is a complete programming environment that serves the reliable execution of distributed application programs. FADI encompasses all aspects of modern fault-tolerant distributed computing. The built-in usertransparent error detection mechanism covers processor node crashes and hardware transient failures. The mechanism also integrates user-assisted error checks into the system failure model....

متن کامل

Personalised Grid service discovery - Software, IEE Proceedings- [see also Software Engineering, IEE Proceedings]

The authors take a broad view that ultimately Gridor Web-services must be located via personalised, semantic-rich discovery processes. They argue that such processes must rely on the storage of arbitrary metadata about services that originates from both service providers and service users. Examples of such metadata are reliability metrics, quality of service data or semantic service description...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004